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Abstract — In this paper, we propose a distributive queue- 
aware intra-cell user scheduling and inter-cell interference (ICI) 
management control design for a delay-optimal celluar downlink 
system with M base stations (BSs), and K users in each cell. 
Each BS has K downlink queues for K users respectively 
with heterogeneous arrivals and delay requirements. The ICI 
management control is adaptive to joint queue state information 
(QSI) over a slow time scale, while the user scheduling control 
is adaptive to both the joint QSI and the joint channel state 
information (CSI) over a faster time scale. We show that the 
problem can be modeled as an infinite horizon average cost 
Partially Observed Markov Decision Problem (POMDP), which 
is NP-hard in general. By exploiting the special structure of the 
problem, we shall derive an equivalent Bellman equation to solve 
the POMDP problem. To address the distributive requirement 
and the issue of dimensionality and computation complexity, we 
derive a distributive online stochastic learning algorithm, which 
only requires local QSI and local CSI at each of the M BSs. We 
show that the proposed learning algorithm converges almost- 
surely (with probability 1) and has significant gain compared 
with various baselines. The proposed solution only has linear 
complexity order 0{MK). 

Index Terms — multi-cell systems, delay optimal control, par- 
tially observed Markov decision problem (POMDP), interference 
management, stochastic learning. 

I. Introduction 

It is well-known that cellular systems are interference 
limited and there are a lot of works to handle the inter-cell 
interference (ICI) in cellular systems. Specifically, the optimal 
binary power control (BPC) for the sum rate maximization 
has been studied in [1|. They showed that BPC could provide 
reasonable performance compared with the multi-level power 
control in the multi-link system. In [2|, the authors studied a 
joint adaptive multi-pattern reuse and intra-cell user scheduling 
scheme, to maximize the long-term network-wide utility. The 
ICI management runs at a slower scale than the user selection 
strategy to reduce the communication overhead. In [3] and the 
reference therein, cooperation or coordination is also shown to 
be a useful tool to manage ICI and improve the performance 
of the celluar network. 

However, all of these works have assumed that there are 
infinite backlogs at the transmitter, and the control policy is 
only a function of channel state information (CSI). In practice, 
applications are delay sensitive, and it is critical to optimize 
the delay performance in the cellular network. A systematic 
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approach in dealing with delay-optimal resource control in 
general delay regime is via Markov Decision Process (MDP) 
technique. In J4), 0, the authors applied it to obtain the 
delay-optimal cross-layer control policy for broadcast channel 
and point-to-point link respectively. However, there are very 
limited works that studied the delay optimal control problem 
in the cellular network. Most existing works simply proposed 
heuristic control schemes with partial consideration of the 
queuing delay [6|. As we shall illustrate, there are various 
technical challenges involved regarding delay-optimal cellular 
network. 

• Curse of Dimensionality: Although MDP technique 
is the systematic approach to solve the delay-optimal 
control problem, a primal difficulty is the curse of dimen- 
sionality [7 1. For example, a huge state space (exponential 
in the number of users and number of cells) will be 
involved in the MDP and brute force value or policy 
iterations cannot lead to any implementable solutiorQ [8 1, 
|9l . Furthermore, brute force solutions require explicit 
knowledge of transition probability of system states, 
which is difficult to obtain in the complex systems. 

• Complexity of the Interference Management: Jointly 
optimal ICI management and user scheduling requires 
heavy computation overhead even for the throughput 
optimization problem Q. Although grouping clusters of 
cells [1 1 and considering only neighboring BSs [ 1 1 were 
proposed to reduce the complexity, complex operations 
on a slot by slot basis are still required, which is not 
suitable for the practical implementation. 

• Decentralized Solution: For delay-optimal multi-cell 
control, the entire system state is characterized by the 
global CSI (CSI from any BS to any MS) and the global 
QSI (queue length of all users). Such system state infor- 
mation are distributed locally at each BS and centralized 
solution (which requires global knowledge of the CSI and 
QSI) will induce substantial signaling overhead between 
the BSs and the Base Station Controller (BSC). 

In this paper, we consider the delay-optimal inter-cell ICI 
management control and intra-cell user scheduling for the 
cellular system. For implementation consideration, the ICI 
management control is computed at the BSC at a longer time 
scale and it is adaptive to the QSI only. On the other hand, the 

'For a celluar system with 5 BSs, 5 users served by each BS, a buffer size 
of 5 per user and 5 CSI states for each link between one user and one BS, the 



system state space contains (5 + 1)° 
unmanageable. 
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intra-cell user scheduling control is computed distributively at 
the BS at a smaller time scale and hence, it is adaptive to both 
the CSI and QSI. Due to the two time-scale control structure, 
the delay optimal control is formulated as an infinite-horizon 
average cost Partially Observed Markov Decision Process 
(POMDP). Exploiting the special structure, we propose an 
equivalent Bellman Equation to solve the POMDP. Based on 
the equivalent Bellman equation, we propose a distributive 
online learning algorithm to estimate a per-user value function 
as well as a per-user Q-factoi[|. Only the local CSI and 
QSI information is required in the learning process at each 
BS. We also establish the technical proof for the almost-sure 
convergence of the proposed distributive learning algorithm. 
The proposed algorithm is quite different from the iterative 
update algorithm for solving the deterministic NUM lfl2ll . 
where the CSI is always assumed to be quasi-static during 
the iterative updates. However, the delay-optimal problem we 
considered is stochastic in nature, and during the iterative 
updates, the system state will not be quasi-static anymore. 
In addition, the proposed learning algorithm is also quite 
different from conventional stochastic learning fTTI . fT3l . For 
instance, conventional stochastic learning requires centralized 
update and global system state knowledge and the convergence 
proof follows from standard contraction mapping arguments 
0. However, due to the distributive learning requirement and 
simultaneous learning of the per-user value function and Q- 
factor, it is not trivial to establish the contraction mapping 
property and the associated convergence proof. We also illus- 
trate the performance gain of the proposed solution against 
various baselines via numerical simulations. Furthermore, the 
solution has linear complexity order 0(MK) and it is quite 
suitable for the practical implementation. 

II. System Model 

In this section, we shall elaborate the system model, as 
well as the control policies. We consider the downlink of a 
wireless celluar network consisting of M BSs, and there are 
K mobile users in each cell served by one BS. Specifically, 
let M = {1,...,M} and JC m = {1,...,K} denote the set of 
BSs and the set of users served by the BS m respectively. 
k G JC m denotes the fc-th user served by BS m. The time 
dimension is partitioned into scheduling slots (every slot lasts 
for r seconds). The system model is illustrated in FigQ] 

A. Source Model 

In each BS, there are K independent application streams 
dedicated to the K users respectively. Let A(t) = 
{A m (t)}% =1 and A rn (t) = {A {rritk) (t)}£ =1 , where A (m , k) (t) 
represents the new arrivals (number of bits) for the user 
k G K, m at the end of the slot t. 

Assumption 1 (Assumption on Source Model): We assume 
that the arrival process A( m fe j (t) is i.i.d over the scheduling 
slot t according to a general distribution Pr{A( m fc)} with 

2 The Q-factor Q(s, a) is a function of the system state s and the control 
action a, which represents the potential cost of applying a control action a at 
the current state s and applying the action a' = argmin a Q(s'.a) for any 
system state s' in the future 1111 . 
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Fig. 1. Physical layer and queueing model of celluar network. 



average arrival rate \( m .k) = ^>[At mi f.)], and the arrival 
processes for all the users are independent with each other, 
i.e., Pr{A( m ,k)A(^ n ^ } = Pr{^( mifc )}Pr{A (n / )} if m ^ n or 
k^l. ■ 
Let Q{t) = {Q m (»}* f = i G Q denote the global QSI in 
the system, where Q is the state space for the global QSI. 
Qm(t) = {Q(m,k)(t)}^=i denotes the QSI in the BS m, where 
Q(m,k)(t) represents the number of bits for user k G K. m at 
the beginning of the slot t, and Nq denotes the maximal buffer 
size (bits). When the buffer is full, i.e, Q( m ^k) = Nq, new bits 
arrivals will be dropped. The cardinality of the global QSI is 
I Q = (l + N Q ) MK . 

B. Channel Model and Physical Layer Model 

Let H" m fc N (t) and L? m ^ denote the small scale channel 
fading gain and the path loss from the n-th BS to the user 
k G K m respectively, and H( TO ,fc)(t) = {H^ m k) {t)}^ =1 is the 
local CSI states for user k. H m (t) = {H( m ^k){ty\k=i denotes 
the local CSI states for BS m, and the global CSI is denoted 
as H(<) = {H rn (t)}^f l=1 G %, where H is the state space for 
the global CSI. 

Assumption 2 (Assumption on Channel Model): We 
assume that the global H is quasi-static in each slot. 
Furthermore, H™ m ^ (t) is i.i.d over the scheduling slot t 
according to a general distribution Pi{H" m and the small 
scale channel fading gains for all users are independent with 
each other. The path loss ^ remains constant for the 
duration of the communication session. ■ 

The cellular system shares a single common channel with 
bandwidth WHz (all the BSs use the same channel). At the 
beginning of each slot, the BS is either turned on (with trans- 
mit power -P^ax) or off (with transmit power oJE according 
to a ICI management control policy, which is defined later. 
At each slot, a BS can select only one user for its data 
transmission. Specifically, let p = {p m }m=i e denotes 
an ICI management control pattern, where p m = 1 denotes 
BS m is active, p m = otherwise, and V denotes the set of 
all possible control patterns. Furthermore, let _A/! P G A4 be 
the set of BSs activated by the pattern p and V m G V be the 
set of patterns that activate the BS m. The signal received by 

3 Note that the on-off BS control is shown to be close to optimal in [T], 
(2). Moreover, the solution framework can be easily extended to deal with 
discrete BS power control. 
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the user k G K m at slot t, when pattern p G V m is selected, 
is given by 



E 



(i) 



where a; m [t] is the transmit signal from the m-th BS to the 
fc-th user at slot f, and {«[t]}^i is the i.i.d Af(0,N o ) noise. 
The achievable data rate of user k can be expressed by 

R(m,k) 

| wlQg2 ^ + i^gi^M) S(mifc) if p e p ro 

I otherwise 

(2) 

S(m,fe) e 



where /, 



Epn on r n 

- r max- n ( m ,fe)-^(m,fe)' 



{0, 1} is an indicator variable with S( m ,k) = 1 when the user 
k is scheduled. £ G (0, 1] is a constant can be used to model 
both the coded and uncoded systems [5|. 



C. ICI Management and User Scheduling Control Policy 

At the beginning of the slot, the BSC will decide which 
BSs are allowed to transmit according to a stationary ICI 
management control policy defined below. 

Definition 1 (Stationary ICI Management Control Policy): 
A stationary ICI management control policy il p : Q —> V is 
defined as the mapping from current global QSI to an ICI 
management pattern O p (Q) = p. ■ 

Let x(t) = {H(i),Q(£)} to be the global system state at 
the beginning of slot t. The active user at each cell is selected 
according to a user scheduling policy defined below. 

Definition 2 (Stationary User Scheduling Policy): A 
stationary user scheduling policy fl s : {Q, H} —> S is defined 
as the mapping from current global system state x to current 
user scheduling action f2 s (x) = se5. The scheduling action 
s is a set of all the users' scheduling indicator variable, i.e., 
s = {s( m ,fc),Vfc G K m , Vm}. It represents which users are 
scheduled and which users are not in any given slot. S is the 
set of all user scheduling actions. ■ 

For notation convenience, let SI = {fl p , S7 S } to be the joint 
control policy, and Sl(x) = {p, s } be the control action under 
state x- 

III. Problem Formulation 

In this section, we will first elaborate the dynamics of 
system state under a control policy Q. Based on that, we shall 
formally formulate the delay-optimal control problem. 

A. Dynamics of System State 

Given the new arrival A/ m ,k) (*) at the en d °f the slot t, the 
current system state x(t) and the control action fl(x(t)), The 
queue evolution for user k G K m is given by: 



given by (0, is the achievable data rate under the con- 
trol action f2(%(t)). [^J denotes the floor of x, (x) + = 
max(x, 0), and (x)y\ n q = min(x, Nq). Let U(i) = 
{U ro (i)}£f =1 , and U ro (t) = {C/ (mi)fc) (t)}f =1 , U { m, k) {t) = 
R( m ,k)(x(t),Q(x{t)))T for the user k G K m , and Q(i+1) = 
[(Q(i) - U(<)) + + A(t)] . N . Therefore, given a control 
policy S7, the random process {H(t),Q(t)} is a controlled 
Markov chain with transition probability 



Pr{x(t + l)\x(t)Mx{t))} = 

Pr{H(t + l)}Pr{A(i)} if Q(f + 1) 
otherwise 



Q(*+l) 



(4) 



B. Delay Optimal Control Problem Formulation 

Given a stationary control policy S7, the average cost of the 
user k G K m is given by: 



E[f(Q (m>k) (t))] (5) 



where f{Q( m ,k)) is a monotonic increasing cost function of 
Q( m ,k)- For example, when f(Q( m , k )) = Q(m,k)/\mji), 
using Little's Law H, fl4l . Tr mtk -\ (&) is an approximatiorQ of 
the average delay of user k. When /(Q( m ,k)) = l{Q (7Il-fc) >w Q } 
and At m ^ follows the bernoulli process, TV m is the M 
dropping probability (conditioned on bit arrival). Note that, 
the MK queues in the celluar system are coupled together 
via the control policy Q. In this paper, we seek to find an 
optimal stationary control policy S7 to minimize the average 
cost in ©. Specifically, we have: 

Problem 1 (Delay Optimal Multi-cell Control Problem): |f| 
For some positive constants (3 = {/3( m ,fc), , Vfc G /C m ,Vm}, 
finding a stationary control policy SI that minimizes: 



mm J? = V 5 (m)fe )T( fe) (0) 



(6) 



Hm sup ijf" E n [ ff ( X (i),0( X (t)))] 

T-tno J 't=l 



where ft (%(*)) = E m , fe P(m,k)f(Q(m,k)) is the P er " 

slot cost, and E denotes the expectation w.r.t. the induced 
measure (induced by the control policy S7 and the transition 



4 Strictly speaking, the average delay is given by T( m ^,)(f2) = 

limsup-r^rc ^T,I=iH X(m t) a-PBD (m-fc) ) ]- where PBD (™,fc) 
is the bit dropping probability conditioned on bit arrival. 
Since our target bit dropping probability PBD( m fc ) <C 

1, T lmh) (Sl) = ^ T raT«i^*J_l 



(m, fc) 

limsupr^^ ^ELi e [a 



lim supy^^ ^ Et=i e It 



T wr Q(m,k) 

(m,fe)(l-PBD ( . 



Je)) 



5 In fact, the proposed solution framework can be easily extended to 
deal with a more general QoS based optimization. For example, say 
we minimize the average delay subject to the constraints on average 



data rate: il( ra ^)(fi) 



lim supy^^ T Et=i E [R(m,k) (*)] > 



The La grangian of such constrained optimization is: 



minn 



Em,fe [P(m,k)T(m,k)(^) + A»(m,fc) R(m,k) (fl)] 



Q(m,k)( t + 1 ) = [(Q(m,fc)(*)-^(m,fc)(t)) +^(m,fc)(*)] A Wn Umsup^^ 4 E^l E n [s M (x(*),n(x(*)))] 



' A 

(3) 

where U { „ hk) (t) = [R( m ,k)(x(t), ^(x(*)))tJ is the number 
of bits delivered to user fc at slot t, and R( m ,k){x{t)i Q(x(t))), 



where 



— Em fi(m,k)f(Q(m,k)) + M(m,fe) R(m,k) > 

and H(m,k) ls me Lagrange multiplier corresponding to the QoS constraint 
^(m,fe)(^) Note that it has the same form as (6) and the proposed 

solution framework can be applied to the QoS constrained problem as well. 



TO BE APPEARED IN IEEE TRANS. WIRELESS COMMUN. 



4 



kernel in (0). The positive constants (3 indicate the relative im- 
portance of the users and for a given (3, the solution to (|6]l cor- 
responds to a Pareto optimal point of the multi-objective opti- 
mization problem given by mino T(m,fc) (^) , Vm, k. Moreover, 
a control policy ST is called Pareto optimal if for any control 
policy Q' ^ ST such that T (m)fc )(fi') < T { „ hk) (SI*), Vm, fe, it 
implies that T( mfe )(Sl') = T( TO m(ST), Vm, fc. In other words, 
we cannot reduce T( m without increasing other component 
(say T( m £ 2 )) at Pareto optimal control ST fl5l . 

IV. General Solution to the Delay Optimal 
Problem 

In this section, we will show that the delay optimal problem 
[T] can be modeled as an infinite horizon average cost POMDP, 
which is a very difficult problem. By exploiting the special 
structure, we shall derive an equivalent Bellman equation to 
solve the POMDP problem. 

A. Preliminary on MDP and POMDP 

An infinite horizon average cost MDP can be characterized 
by a tuple of four objects: {§, A, Pr{s'|s, a}, g(s, a)}, where 
§ is a finite set of states and A is the action space. Pr{s'|s, a} 
is the transition probability from state s to s', given that the 
action a 6 A is taken. g(s, a) is the per-slot cost function. 
The objective is to find the optimal policy a = {a(s)} so as 
to minimize the average per-slot cost 9 as: 

9 = min lim sup ± ^ E a [ 5 (s(i), a(s(t)))} (7) 

a T— >oo 1 £ — 't=l 

If the policy space consists of unichain policies and the 
associated induced Markov chain is irreducible, it is well 
known that there exist a unique 9 for each starting state Q, 
ifTTl . Furthermore, the optimal control policy a can be obtained 
by the following Bellman equation. 

V{s) + 9 = min {g(s, a(s)) + V , Pr{ S '| S , a( S )}y(s'))} 

a{s) t z — 's' J 

(8) 

where V(s) is called the value function. General offline 
solutions, value or policy iteration, can be used to find the 
value function V(s) iteratively, as well as the optimal policy 

POMDP is an extension of MDP when the control agent 
does not have direct observation of the entire system state (and 
hence it is called "partially observed MDP"). Specifically, an 
infinite horizon average cost POMDP can be characterized by 
a tuple HH, ED: {§, A, Pr{s'|s, a}, g(s, a), O, 0(z, s, a)}, 
where {§, A, P(s'\s, a),g(s, a)} characterize a MDP and O 
is a finite set of observations. 0(z, s, a) is the observation 
function, which gives the probability (or stochastic relation- 
ship) between the partial observation z, the actual system 
state ,s and the control action a. Specifically, 0(z, s, a) is the 
probability of getting a partial observation "z" given that the 
current system state is s and the action a was taken in the 
previous slot. A PODMP is a MDP where current system state 
and the actions are based on the observation z. The objective 
is to find the optimal policy a = {a (z)} so as to minimize the 
average per-slot cost 9 in ©. However, in general, it is a M 5 - 
hard problem and there are various approximation solutions 



proposed based on the special structure of the studied problems 

m. 

B. Equivalent Bellman Equation and Optimal Control Policy 

In this subsection, we shall first illustrate that the optimiza- 
tion problem Q] is an infinite horizon average cost POMDP. We 
shall then exploit some special problem structure to simplify 
the complexity and derive an equivalent Bellman equation to 
solve the problem. For instance, in the delay optimal problem 
Q] the ICI management control policy S7 p is adaptive to the 
QSI Q, while the user scheduling policy S7 S is adaptive to the 
complete system state {Q, H}. Therefore, the optimal control 
policy SI* cannot be obtained by solving a standard Bellman 
equation from conventional MDf|§ In fact, problem Q] is a 
POMDP with the following specification. 

• State Space: The system state is the global QSI and CSI 
X={Q,H}e{Q,H}. 

• Action Space: The action is ICI management pattern and 
user scheduling {p,s} £ {V,S}. 

• Transition Kernel: The transition probability 
P*{X'\X,P, S } is given in ©. 

• Per-Slot Cost Function: The per-slot cost function is 

s(x,P,s) = T, m ,kP(m,k)f{Q(m,k))- 

• Observation: The observation for ICI management con- 
trol policy is global QSI, i.e., z p = Q, while the 
observation for User scheduling policy is the complete 
system state, i.e., z s = X- 

• Observation Function: The observation function for 
ICI management control policy is O p (z p ,x, P,s) = 1, 
if z p = Q, otherwise 0. Furthermore the observation 
function for user scheduling policy is O s (z s , x, P, s ) = 1* 
if z s = otherwise 0. 

While POMDP is a very difficult problem in general, we 
shall utilize the notion of action partitioning in our problem to 
substantially simplify the problem. We first define partitioned 
actions below. 

Definition 3 (Partitioned Actions): Given a control policy 
0, we define fi(Q) = {(p,s) = Sl(x) : X = (Q,H)VH £ 
H} as the collection of actions under a given Q for all possible 
HeH. The complete policy SI is therefore equal to the union 
of all partitioned actions, i.e., SI = IJq S1(Q). ■ 

Based on the action partitioning, we can transform the 
POMDP problem into a regular infinite-horizon average cost 
MDP. Furthermore, the optimal control policy S7* can be 
obtained by solving an equivalent Bellman equation which 
is summarized in the theorem below. 

Theorem 1 (Equivalent Bellman Equation): The optimal 
control policy SI* = (SI*, SI*) in problem Q] can be obtained 
by solving the equivalent Bellman equation given by: 



V(Q)+9 



mm 

n(Q) 



«KQ, ti(Q))+J2 PKQ'IQ, si(Q)MQ') 

Q' 

(9) 

where §(Q,0(Q)) = Em,^(m,t)/W(m,fc)) is the P er " 
slot cost function, and the transition kernel is given 

6 The policy will be a function of the complete system state by solving a 
standard bellman equation. 
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by Pr{Q'|Q,ft(Q)} = E H [Pr{Q'|Q, H, ft(*)}], where 
Pr{Q'|Q,H,ft( X )} is given by 



Pr{Q'|Q,H,ft( X )} = 

j Pr{A} if Q' = [(Q - U) 

I otherwise 



A N Q 



(10) 



where U = {U m }*» =1 , and U TO = {t/( m ,fc)}f = i, and 
R(m,k)(x,tt(x))r for k G K m . Suppose ft*(Q) = 



,fe) 



{P*(Q)> Uh s *(Q' H)} is a solution that solves the Bellman 
equation in (O, the optimal control policy for the original 
Problem Q] is given by: ft* = Uq{p*(Q)I and n * s = 
Uq,h{ s *(Q 5 h )}- The value function V(Q) that solves © 
is a component-wise monotonic increasing function. ■ 
Proof: Please refer to Appendix A. ■ 
Note that solving (O will obtain an ICI management policy 
ft* that is a function of QSI Q and a user scheduling policy 
ft* that is a function of the QSI and CSI {Q, H}. We shall 
illustrate this with a simple example below. 

Example 1: Suppose there are two BSs with equal transmit- 
ting power (P™ ax = P, Vm), and there are three ICI manage- 
ment control patterns in V, given by pi = {pi = 1,P2 = 0} 
(BS 1 is active), P2 = {p\ = 0,p2 = 1} (BS 2 is active) 
and P3 = {pi — 1,P2 — 1} (both BSs are active). Assume 
deterministic arrival where one bit will always arrive at each 
slot, i.e., Pr{A( m fc) = 1} = 1. The number of users served 
by each BS is K — 2. The path loss L" m fe . = 1 for all 
{k, n,m}, and the small scale fading gain is chosen from 
two values {H g ,Hi,} with equal probability. As a result, the 
global CSI state spaced is U = {H g ,H b } M2K . Note that the 
cardinality of CSI state space H is \H\ = 1 M ^ K = 256. Given 
a realization of the global QSI Q, the partitioned actions 
(following Definition O is given by: 

ft(Q) = { P (Q),s(Q,HW),..- ,s(Q,H( 256 >)} (11) 

Using Theorem [T] the optimal partitioned action ft*(Q) is 
given by solving the right hand side (RHS) of (|9): 

ft*(Q) = argmin Eq'EhWgW 

{p(Q),{s(Q,H(*))}fS> 

Pr{H«} Pr{Q'|Q, H«,p(Q), s(Q, H^)}V(Q') 



where 

Pr{Q'|Q,H»,p(Q),s(Q,HW)} = 

f 1 if Q'= [(Q-U) + + 1 
I otherwise 



A N Q 



(12) 



(13) 



and U = {7(1,2)5 J7(2,i),J7( 2 ,2)} is the number of 

departure bits. For a given ICI management control p(Q) = p, 
the optimal user scheduling policy {s*(Q, Hw)} is 



{s*(Q,H«)} = argmin Eq' EhBsw 

{s(Q,HC«))}g6 

Pr{H«} Pr{Q'|Q, H«, p, s(Q, HW)}V(Q') 



(14) 



7 For the sake of easy discussion, we consider discrete state space in this 
example. Yet, the proposed algorithms and convergence results in the paper 
work for general continuous state space as well. 



Observe that the RHS of (fl4l is a decoupled objective func- 
tion w.r.t. the variables {s(Q, H''')}^ and hence, applying 
standard decomposition theory, 

s*(Q, H^l = 

argmin E Pr{Q'|Q, H«, p, s(Q, HW)}y(Q') (15) 

s(Q,H«) Q' 

As a result, the optimal ICI management control policy p*(Q) 
is given by: 

p*(Q) = argmin p(Q) Eq- EhW£« 

Pr{H^} Pr{Q'|Q, H«, p(Q), s*(Q, HW)}V(Q') 

(16) 

where s*(Q,HW) given in (Tl5T > is the optimal user schedul- 
ing policy under the ICI management control policy p(Q). 
Using Theorem [T] the optimal ICI management control and 
user selection control of the original Problem [T] for a CSI 
realization H^' and QSI realization Q are given by p*(Q) 
and s*(Q, H' 1 ') respectively. ■ 

V. Distributive Value Function and Q-factor 
Online Learning 

The solution in Theorem [T] requires the knowledge of the 
value function V(Q). However, obtaining the value function is 
not trivial as solving the Bellman equation (O involves solving 
a very large system of the nonlinear fixed point equations 
(corresponding to each realization of Q in (0). Brute-force 
solution of V(Q) require huge complexity, centralized imple- 
mentation and knowledge of global CSI and QSI at the BSC. 
This will also induce huge signaling overhead because the QSI 
of all the users are maintained locally at the M BSs. In this 
section, we shall propose a decentralized solution via distribu- 
tive stochastic learning following the structure as illustrated in 
Fig. |2] Moreover, we shall prove that the proposed distributive 
stochastic learning algorithm will converge almost-surely. 

A. Post-Decision State Framework 

In this section, we first introduce the post-decision state also 
used framework, also used in [19] and the references therein, 
to lay ground for developing the online learning algorithm. 
The post-decision state is defined to be the virtual system 
state immediately after making an action but before the new 
bits arrive. For example, x = {Qi H} is me state at me 
beginning of some time slot (also called the pre-decision 
state), and making an action ft(x) = {p, s }> the post-decision 
state immediately after the action is x = {Qi H}, where the 
transition to Q is given by Q = (Q-U) . If new arrivals A 
occur in the post-decision state, and the CSI changes to H', 
then the system reaches the next actual state, i.e., pre-decision 
state,x' = {[Q + A] AArQ ,H'}. 

^ Using the action partitioning and defining the value function 
V on post-decision state Q (where pre-decision state is 
{Q = [Q + A] j H}), V will satisfy the post-decision 
state Bellman equation lfT9l 



(17) 



V(Q) + e = J2 A Pv{AU min n(Q) s(Q,fi(Q)) 



Eq, Pr{Q'|Q,ft(Q)}F(Q') 
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where g(Q,fi(Q)) = Y, m ,kP{m,k)f{Q{m,k)), 

Pr{Q'|Q,ft(Q)} = E H [Pr{Q'|Q,H,ft(Q)}], and Q' 
is the next post-decision state transited from Q. As Theorem 
[TJ V(Q) is also a component- wise mono tonic increasing 
function. The optimal policy is obtained by solving the RHS 
of Bellman equation (TTTb . 

B. Distributive User Scheduling Policy on the CSI Time Scale 

To reduce the size of the state space and to decentralize the 
user scheduling, we approximate V'(Q) in (fTTT i by the sum of 
per-user post-decision state value functional V( m) fe)(<3( m ,fc)), 
i.e., 



V(Q) 



E, 



V(m,k)(Q(m,k)) 



(18) 



where V^ m ^(Q( m ty) is defined as the fixed point of the 
following per-user fixed point equation: 

m,k)) + V{m,k) {Q\ m .k) ) — 
EA (mil) P r {^(m,fe)} P(m,k)f{Q(m,k)) + 

Y, Pr{Q' (mife )|<3( TO ,fe),S( m ,fc) = l>Pm}^(ro,fe)(Q( m ,fc)) 



where <3( 



m : k) 



m,k) 



3 (m,fe) 



(19) 

^(m.fc) is the pre-decision state, 
= 1 means that the user k is scheduled to transmit 
at BS m, Q 1 , fc * G {0, • • • ,Nq} is a reference state and 
Pm S 7-Vn is a reference ICI management pattern (with the 
BS m active). The per-user value function V(jn^{Qi m ,k)) is 
obtained by the proposed distributive online learning algorithm 
(explained in section IV-Db . Note that the state space for 
the value function of V"(Q) is substantially reduced from 
(Nq + \~) MK (exponential growth w.r.t the number of all 
mobile users MK) to MK(Nq + 1) (linear growth w.r.t the 
number of all mobile users). 

Corollary 1 (Decentralized User Scheduling Actions): 
Using the linear approximation in ( fT8l . the user scheduling 
action of BS m G J\A P under any given ICI management 
pattern p (obtained by solving the RHS of Bellman equation 
(JT7J) is given by: 



where k* 



l,s 



(m.k) 



0, Vfc ^ k* and k, k* G IC m } (20) 



k)(Q(rn,k)) 

u im , k) )+fi U { 

<P(m,k) 



^argmax fce/Cm <5(m,fc)(Q( m ,fc)), and 

m,k) 



(m,k) 



E 



l 



£<f>(r, 



= log 2 v 

Tjn Tjn T n 
r ma,^ rL ( m ,k) 1J (m y k) 

is the power sum of interference and noise 
,k) L Tm,k) is *e signal power. 
Proof: Please refer to Appendix B. 



t, where 

+ N W 

and 



pm Tin. 

(m.k) r max n (r 



8 Using the linear approximation in t!8t . we can address the curse of 
dimensionality (complexity) as well as facilitate distributive implementation 
where each BS could solve for V( m ,k) (Q(m,k)) based on local CSI and QSI 
only. 

9 Note that 5f m m (0) = 0, Vfc, and hence the users with empty buffer will 
not be scheduled and the activated BS m will serve the users with non-empty 
buffer (the chance for the buffer of all K users being empty at a given slot 
is very small). 
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Fig. 2. The system procedure for distributive per-user value function and 
per-user Q-factor online learning algorithm. 



Remark 1 (Structure of the User Scheduling Actions): The 
user scheduling action in (f2T7b is both function of local CSI 
and QSI. Specifically, the number of bits to be delivered 
U(m.k) iS controlled by the local CSI H( m and local QSI 
Q(m,k) w iH determine $( mi k)(Q(m,k))- Each user estimates 
<p( m ,k) an( l ^{m.k) m me preamble phase, and sends Ur m ^ 
to the associated BS m according to the process as indicated 
in FigE] ■ 



(21) 



C. ICI Management Control Policy on the QSI Time Scale 

To determine the ICI management control policy, we define 
the Q-factor as follows ifTTI : 

Q(Q,P) = E m ,kl 3 (m,k)f(Q(m,k)) + 

Eq< Pr{Q'|Q, P} mii V Q(Q', P') - 

where Pr{Q'|Q,p} is the transition probability from current 
QSI Q to Q', given current action p, and 8 is a constant. 
Note that the Q-factor Q(Q, p) represents the potential cost of 
applying a control action p at the current QSI Q and applying 
the action argmin p ' Q(Q', p') for any system state Q' in the 
future. Similar to ( fT8l ), we approximate the Q-factor in ( l2Tb 
with a sum of per-user Q-factor, i.e, 



Q(Q,p)~V 

' 4 r 



i,k)(Q(m,k): P) 



(22) 



where Q/ m ,k) is defined as the fixed point of the following 
per-user fixed point equation: 

Q(m,k) [Q{m,k) j P) = 

P{m,k)f{Q{m,k)) ~ Q(m,k)(Q( m fc):Pm) + J2 

Q', 



^ >r {'3(m,fe) \Q{m,k) : s (m,k) 

where Yr{Q', 



1, p} minC 
p' 



' (m,k)\Q (rn,k) : s (m,k) 
E H(m , fc) [Pr{Q( m ,fc)|Q(m,fc), S(m,fc) 



(m,fc) 

m,k){Q( mt k)' P 

(23) 

1,P} 

Ij H(m,k), P}]- 
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Before QSI Partition 



After QST Partition 



0.1.2 


3,4,5 


6,7 


8,9 



Ki = {y]=o,Qj = o) 

Km,= W, = 9,Qj = 9} 



n, = {Q, e [o,i, 2},Q,z {o,i,2D 
Km = (<Ji e (8,9},g a € {8,91) 



Fig. 3. Illustration of one possible way of local QSI partition. There are 
K = 2 users with buffer size Nq = 9, where each user's QSI is partitioned 
into 4 regions, given by {{0, 1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. Note that the 
number of local QSI regions for one BS is largely reduced from (Nq + 1) k = 
100 (TCi = {Qi = 0,Q 2 = 0},- - ,7eioo = {Qi = 9,Q 2 = 9}) to 
4 K = 16 (72i = {Qi e {0,1,2},Q 2 £ {0,1, 2}}, • ■ • ,K 16 = {Qi e 
{8,9},Q 2 £ {8,9}}) after partition. 



Q(m k) e {0'''' tNq} is a reference state and G V 
is a reference ICI management control pattern. The 
per-user Q-factor Qf mt k) is obtained by the proposed 
distributive online learning algorithm (explained in 
section IV-Dt . The BSC collects the per-BS Q-information 

Qm(p) = E (m , fc) Q( TO ,fc)(Q(m,*)>P) at the beginning of slot 
t, and the ICI management control policy is given by: 



p*=argmin V Q^(p) 



(24) 



In order to reduce the communication overhead between the 
M BSs and the BSC, we could further partition the local QSI 
space into N region^ (Qm = U^=i ^n) as illustrated in Fig. 
|3] At the beginning of the t-th slot, the m-th BS will update 
the BSC of the per-BS (Q-information if its QSI state belongs 
to a new region. Hence, the per-BS Q-information at the BSC 
is updated according to the following dynamics: 



{J2 Q(m,fe)( ( 3(m,fc) 



,p) if Q^G^Q* 
otherwise 



Remark 2 ( Communication Overhead): The communica- 
tion overhead between the M BS and the BSC is reduced 
from 0((N Q + 1) MK + {N h ) m ' 2k ) (exponential growth w.r.t 
the number of users K) to 0(M(a)^) for some constant a 
(O(l) w.r.t. K), where N H is the cardinality of the CSI state 
space for one link. ■ 



D. Online Per-User Value Function and Per-User ^-factor 
Learning Algorithm 

The system procedure for distributive online learning is 
given below: 

• Initialization: Each BS initiates the per-user value 
function and Q-factor for its K users, denoted as 



For example, one possible criteria is to partition the local QSI space so 
that the probability of Q m belonging to any region is the same (uniform 
probability partitioning). 



^ / (m,fe)( ( 5(^ 



,k)(Q'(m,k)) > 











• 


0,1,2 


3,4,5 


6,7 


8,9 





f( m ,k)h where V°, 

■k))yQ[ m ,k) > Q(rn,k)- 

ICI Management Control: At the beginning of the t-th 
slot, the BSC updates the Q-information Q^(p) as d25l ) 
and determines the ICI management pattern as d24l l. 
User Scheduling: If m G M p t, BS m is selected 
to transmit. The user scheduling policy is determined 
according to d20l >. 
. Local Per-user Value Function and Per-user Q-factor 
Update: Based on the current observations, each of the 
M 

the per-user Q-factor 

Fig. |2] illustrates the above procedure by a flowchart. The 
algorithm for the per-user value function and per-user Q-factor 
update is given below: 

Algorithm 1 (Online Learning Algorithm): Let Q m and 
Q m be the current observation of post-decision and pre- 
decision states respectively, A m be the current observation 
of new arrival, {H.t m ,k)}k=i be the current observation of 
the local CSI, and p is the realization of the ICI management 
control pattern. The online learning algorithm for user k G K, m 
is given by 



BSs updates the per-user value function Vi m> k) and 
m according to Algorithm Q] 



V, 



(m,k)(Q(m,k)) — 

<k ){Q(m,k)) +7(*) P{r, 



V? 

[rr. 



,k)f(Q(t 



A 



(m,k)) 



A 



(m,/c) 



,fc)- 
■u. 



(m,k) 



) ifp 



(rr 



)k )(Q(m,k)) 



,fe) 



V? m , k) (Q(r 



,k)) 



11 (Q(m,k),p)=Q\ m , k) (Q(r 



^*-(jn,fe) 
■f{Q(n 



otherwise 
(26) 

,fc),P) +7(*) P(m,k) 



(QL 

(m,k) ^(m,fe) 



(in,. fe)> Prrr) 

(Q 



5( m ,fc)(Q(m,fc)>P) 



.4 



(m,k) 



p') 



(27) 

where Ut m ,k) is the number of bits to be delivered for user k 
(given in Corollary Q] and depends indirectly on the local CSI 
observations H (Wifc) ), {Q( mjfe) ,P„} and {Qf^, p^} are 
the reference state and reference ICI management pattern for 



(25) tne value function V( TO) fe) in ( [19] ) and Q-factor 



,fc) 



respectively. 7(n) is diminishing positive step size sequence 
satisfying 7 (n) = oo, J2 n 7 2 («) < oo. ■ 

Remark 3 (Complexity of the Learning Algorithm): The 
proposed learning scheme only requires the observations 
of the local QSI Q m and Q m . Furthermore, each users 
only need to feedback Ui m ^ instead of the local CSI H m , 
which is of similar feedback loading compared with HSDPA 
systems. ■ 

E. Convergence Analysis 

In this section we will establish the convergence proof of 
the proposed per-user learning algorithm Q] We first define a 
mapping on the post-decision state Q( m ,k) as 

T(m,k)(y{m,k),Q{m,k)) = 9{ m ,k){Q 
J2 Pr {Q'( m ,fe)IQ("i,fe)' s (m,fc) = 1 >Pm} V (m,k){Q{m,k)) 

(28) 
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where Q( m ^\ = Q(m,k) + ^-(m,k) i s tne pre-decision state, 

9(m,k){Q( m; k)) =^ ^A (m> ^[P(m,k)f(Q(m,k) + ^(m,fc))]> 

and Pr{Q' (mfe) |Q( m , fc ),S(m,fc) = l,Pm} = 
EH ( „ l-fc) ,A ( „ lifc) [Pr{Q' (mfc) |(3( m ^),H( m ^),s (mifc) = l,p£j]. 
The vector form of the mapping is given by: 

T( m ,k)(V( m) fc)) = g( m ,fe) + P(m,fc)V m (29) 

where P( TOj m is (Nq + 1) x (Nq + 1) transition matrix for 
the post-decision state queue of the user k. and "V( m ,k) are 
(Nq + 1) x 1 vectors. Specifically, we have the following 
lemma for the per-user value function learning in d26T i. 

Lemma 1 (Convergence of Per-User Value Function): The 
update of the per-user value function V* m ^ will converge 
almost-surely in the proposed learning algorithm [TJ i.e., 
lim t ^V[ mfc) = V~ fc) ,Vfc,m, and V^ k) (Q {m , k) ) is a 
monotonic increasing function satisfying: 

V (m,fc) + V (m,k) (Q\m,k) ) e = T (r»,fc) ) (30) 

Proof: Please refer to Appendix C. ■ 
Note that ( f30b is equivalent to the per-user fixed point 
equation in ( fT9l . This result illustrates that the proposed online 
distributive learning in d26l i can converge to the target per-user 
fixed point solution in ( fl9l ). We define a mapping for the per- 
user Q-factor Q (m>fe ) as 

T (L fc)(Q(m,fe)j Q(m.k), P) = P(m.k)f(Q(m.k)) + 
Pl 'i ( 3(m,fc)l < 3(™,fc)' S ("',fc) = !' P) n H n Q(m,fc)(Q( m ,fc)(P / ) 

(31) 

Specifically, we have following lemma for the Q-factor online 
learning in (f2Tb . 

Lemma 2 (Convergence of the Per-User Q-factor): 
The update of per-user Q-factor Q( m m will converge 
almost-surely in the proposed learning algorithm [TJ i.e., 

li m t— >oo Q* TO fe) = QfWfe)'^'" 2 ' where the steady state 
Q-factor {Q^ fc) } satisfy: 

Q&,*)(Q(m,*),p) = 

fc),Q(m,fc),P) 

(32) 

Proof: Please refer to Appendix D. ■ 
Note that ( 132b is equivalent to the per-user fixed point 
equation for Q( m ,k) m < l23l . This result illustrates that the 
proposed online distributive learning in d27l i can converge to 
the target per user fixed point solution in J23l . 

Lemma [TJ and [2] only established the convergence of the 
proposed online learning algorithm. Strictly speaking, the 
converged result is not optimal due to the linear approximation 
of the value function V(Q) and the Q-factor Q(Q, p) in ([TST l 
and d22] > respectively. The linear approximation is needed for 
distributive implementation. As illustrated in Fig. [H the pro- 
posed distributive solution has close-to-optimal performance 
compared with brute-force centralized solution of the Bellman 
equation in ([9]). 




0.2 0.25 0.3 0.35 0.4 0.45 

Packet Arrival Rate (Bits/Slot) 



Fig. 4. Average delay per user versus per user loading A( to m i n Ine Example 
1 with the source arrival model is given by Pr{A( m fe) = 1} = Xt m M and 
Pr{j4,( m m = 0} = 1 — A( m m for all m, k, and the buffer size Nq = 3. 
Centralized optimal solution refers to the brute-force centralized solution of 
the Bellman equation in (5). Baseline 1 refers to the CSIT only scheme, 
where the user scheduling are adaptive to the CSIT only. Baseline 2 refers to 
the Dynamic Backpressure scheme [20]. Baseline 3 refers to the time-scale 
decomposition scheme proposed in (2). 

VI. Simulation and Discussion 

In this section, we shall compare the proposed distributive 
queue-aware intra-cell user scheduling and ICI management 
control scheme with three baselines. Baseline 1 refers to the 
CSIT only scheme, where the user scheduling are adaptive 
to the CSIT only so as to optimize the achievable data rate. 
Baseline 2 refers to a throughput optimal policy (in stability 
sense) for the user scheduling, namely the Dynamic Back- 
pressure scheme lEUl . In both baseline 1 and 2, the traditional 
frequency reuse scheme (frequency reuse factor equals 3) is 
used for inter-cell interference management. Baseline 3 refers 
to the time-scale decomposition scheme proposed in [2 |, where 
the sets of possible ICI management patterns V is the same as 
the proposed scheme. In the simulation, we consider a two- 
tier celluar network composed of 19 BSs as in 0, each has a 
coverage of 500m. Channel models are implemented according 
to the Urban Macrocell Model in 3GPP and Jakes' Rayleigh 
fading model. Specifically, the path loss model is given by 
PL = 34.5 + 351og 10 (r), where r (in m) is the distance 
from the transmitter to the receiver. The total BW is 10MHz. 
We consider Poisson packet arrival with average arrival rate 
E[A(,,„ fc)] = \t m> k) (packets/slot) and exponentially dis- 
tributed random packet size A/V m M with E[iV( m = 5Mbits. 
The scheduling slot duration r is 5ms. The maximum buffer 
size Nq is 9 (in packets), where each user's QSI is partitioned 
into 4 regions, given by {{0, 1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. 
The cost function is given by f(Q(m,k)) = A( < "''^ ) ) for all the 
users in the simulations. 

A. Performance w.r.t. Transmit Power 

FigfS] and Fig[6] illustrate the performance of average de- 
lay and packet dropping probability (conditioned on packet 
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Fig. 5. Average delay per user versus transmit power P^ax- The number of 
users per BS is K = 3. The average arrival rate X( mt k) = 1 (packets/slot). 
The maximum buffer size Nq is 9, where each user's QSI is partitioned into 
4 regions, given by {{0, 1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. 



Fig. 7. Average delay per user versus per user loading A( m fe j. The transmit 
power P™ ax = 30dBm. The number of users per BS is K = 3. The 
maximum buffer size Nq is 9, where each user's QSI is partitioned into 
4 regions, given by {{0, 1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. 



Baseline 3 
(Time-Scale Decomposition) 



Baselines 




25 30 35 

Transmit Power (dBm) 




Fig. 6. Packet dropping probability (conditioned on packet arrival) per user 
versus transmit power P,^ ax . The number of users per BS is K = 3. The 
average arrival rate A( m fc ) = 1 (packets/slot). The maximum buffer size 
Nq is 9, where each user's QSI is partitioned into 4 regions, given by 
{{0,1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. 



Fig. 8. Cumulative Distribution Function (CDF) of the queue length per 
user with transmit power P,™ ax = 25dBm. The number of users per BS 
is K = 3. The average arrival rate A( m fc j = 1. The maximum buffer 
size Nq is 9, where each user's QSI is partitioned into 4 regions, given 
by {{0,1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. 



arrival) per user versus transmit power -P™ ax respectively. The 
number of users per BS K = 3, and the average arrival rate 
•\m,fc) — 1- Note that the average delay and packet dropping 
probability of all the schemes decreases as the transmit power 
increases, and there is significant performance gain of the 
proposed scheme compared to all baselines. This gain is 
contributed by the QSTaware user scheduling as well as ICI 
management control. 

B. Performance w.r.t. Loading 

Fig|7] illustrates the average delay versus per user loading 
(average arrival rate A( m m) at transmit power of P™ a x = 
30dBm and the number of users per BS K — 3. It can also be 



observed that the proposed scheme achieved significant gain 
over all the baselines across a wide range of input loading. 

C. Cumulative Distribution Function ( CDF) of the Queue 
Length 

Fig[8]illustrates the Cumulative Distribution Function (CDF) 
of the queue length per user with transmit power P™ ax = 
25dBm. The number of users per BS is K = 3 and the average 
arrival rate A( m ^) = 1. It can be also be verified that the 
proposed scheme achieves not only a smaller average delay 
but also a smaller delay percentile compared with the other 
baselines. 
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Scheduling Slot Index 



Fig. 9. Convergence property of the proposed distributive stochastic learning 
algorithm via stochastic learning. The transmit power P^ax = 35dBm. The 
number of users per BS is K = 3. The average arrival rate A( m m = 1.5. 
The maximum buffer size Nq is 9, where each user's QSI is partitioned 
into 4 regions, given by {{0, 1, 2}; {3, 4, 5}; {6, 7}; {8, 9}}. The figure 
illustrates instantaneous per-user value function Vn t i) {Q(i t i)) and Q-factor 
Q(l,l) (Q(l,l) i Pi) versus instantaneous slot index. The boxes indicated the 
average delay of various schemes at three selected slot indices. 



D. Convergence Performance 

Fig HI illustrates the average delay per user versus the 
scheduling slot index with transmit power P™ ax = 35dBm. 
The number of users per BS is K = 3 and the average arrival 



rate A 



(m,k) 



— 1.5. It can be observed that the convergence rate 



of the online algorithm is quite fast. For example, the delay 
performance of the proposed scheme already out-performs 
all the baselines at the 400-th slot. Furthermore, the delay 
performance at 400-th slot is already quite close to the con- 
verged average delay. Finally, unlike the conventional iterative 
NUM approach where the iterations are done offline within the 
coherence time of the CSI, the proposed iterative algorithm is 
updated over the same time scale of the CSI and QSI updates. 
Moreover, the iterative algorithm is online, meaning that useful 
payload are transmitted during the iterations. 



VII. Summary 

In this paper, we study the design of a distributive queue- 
aware intra-cell user scheduling and inter-cell interference 
management control design for a delay-optimal celluar down- 
link system. We first model the problem as an infinite horizon 
average reward POMDP, which is NP-hard in general. By 
exploiting special problem structure, we derive an equivalent 
Bellman equation to solve the POMDP problem. To address 
the distributive requirement and the issue of dimensionality 
and computation complexity, we derive a distributive online 
stochastic learning algorithm, which only requires local QSI 
and local CSI at each of the M BSs. We show that the 
proposed learning algorithm converges almost-surely and has 
significant gain compared with various baselines. The pro- 
posed algorithm only has linear complexity order O(MK). 



Appendix A: Proof of Theorem[T| 

Based on the action partitioning, we can associate the MDP 
formulation in our delay-optimal control problem as follows: 
. State Space: The system state of the MDP is global QSI 
Q e Q. 

• Action Space: The action on the system state Q is the 
partitioned action 57 (Q) given in Definition [3] and the 
action space is {V,S}. 

• Transition Kernel: The transition kernel is 
Pr{Q'|Q,ft(Q)} = E H [Pr{Q'|Q,H,Q(x)}], where 
Pr{Q'|Q,H,Sl(x)} is given by ©. 

• Per-Slot Cost: The per-slot cost function 
is £(Q,fi(Q)) = E H [»(Q,H,n(x))] 

Em,/: P(m,k)f{Q{rn,k))- 

Therefore, the optimal partitioned action STi*(Q) can be 
determined from the equivalent Bellman equation in ([9). 

Next, we shall prove that V(Q) is a monotonic increasing 
function w.r.t. its component. Given the V l (Q) is the result of 
Z-th iteration, V l+1 (Q) is given by: 

V l+1 (Q) = T n (V l ,Q)-T n {V l ,Q I ) (33) 

where T (V,Q) = min [<?(Q, fi(Q)) + 

Pr{Q'|Q,ol(Q)}^(Q')]> and Q 1 is a reference 
state. Because Iim/-^ V'(Q) = U(Q) [71, it is 
sufficient to prove V l (Q),Vl is component-wise monotonic 
increasing. Using the induction method, we start from 
V (Q) = 0,VQ. In the induction step, we assume that 
VQ 1 >- Q 2 ,F i (Q 1 ) > V l (Q 2 ), we get 

l" +1 (Q 1 )+2h(V , ,Q') 



mm 



> 



> 



3(Q 1 ,^(Q 1 )) + E Pr{Q'|Q\ n(Q 1 )}V r, (Q / ) 

Q' 

E P{ m ,k)f{Q\ m , k) ) + EPr{A}E H [^(Q 2 - U* + A) 



m.k 



min g(Q 2 , 51(Q 2 )) + £ Pr{Q'|Q 2 , S1(Q 2 )}^(Q') 

fi(Q 2 ) Q' 

= ^ +1 (Q 2 )+T (V',Q / ) 

(34) 

where U* is the delivered bits under the conditional action 
^"(Q 1 ) = {p*,s*} for all users. Specifically, Ut m ,k){t) = 

#( mjfc )(H,p*,S*)T. 

Appendix B: Proof of CorollaryQ] 

Using the linear approximation in ( fT8l l. and the given ICI 
management pattern p, the optimal user scheduling action s 
(obtained by solving the RHS of Bellman equation ( fTTT i) is: 



mm s(Q,H)£5 



£(Q,P,s(Q,H))- 



Dq, Pr{Q'|Q,p,s(Q,H)}F(Q') 



min s (Q iH )£S Em,fc (^(m,fc)(Q(m,fc))(l - s (m.fc)) + 

V(m,k){{Q{m,k) - U( m ,k)) + )s{ m ,k) _ 
(m.k) } 

V( m ,k)((Q(m.k) ~ f(mi)) + ))s(m,i ; ),Vm G M p 

(35) 
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where S m = {s m : £ 



keK m S (m,k) = M(m,*) S {0,1}} IS 

the set of all the possible user scheduling policy for BS m. 
As a result, Corollary Q] is obvious from the above equation. 



Appendix C: Proof of LemmaQ] 

From the definition of mapping Tt m ^) in (f28j), the conver- 
gence property of the per-user value function update algorithm 
in ( f26b is equivalent to the following update equation EH : 



yt+l 
V (rn,k) 



(Q(m,k)) 



^ m ,*)(Q(n»,*))+7(*) 



T(m,k)(V( f 



v (m,k) 



(Ql 



(m,fc) > 



(m,fc)> Q(m,k)) 

m,k) , 



yt+l 



where Z 



V (m,k) 



t+1 

(rn,k) 



m,k) ) 

T, 



{m,k) 



and Q' ( 



(36) 

,*)) + 



(Q'(m,k)) T{m,k){yl m ^ k )iQ{m,k)), ^ ( m ,k) 

Q(m,k) + ^(m,fc) - U{m,k)- ^(m,fc) is determined by 
the ICI management control pattern and local CSI 



H 



(m,fe) • 



Let F t 



cr-algebra generated by {V| m Z| m / < t}, It can 



be verified that E 



{H ( „ 



[Z&)|Ft] = 



E 



{H( 



0, and 

II 2 ) for 



a suitable constant C\. Therefore, the learning algorithm in 
( |36] > is a standard stochastic learning algorithm with the 
Martingale difference noise Z*^ 1 ^ . We use the ordinary differ- 
ential equation (ODE) to analyze the convergence probability. 
Specifically, the limiting ODE associated for (136b to track 
asymptotically is given by: 



V( m ,k)(f) 



i n, .!,->{ <>)— x (m^fe) (V( m ,fc) (t)) — _V(m,fc) (*)" (37) 

V( m , fe ) (<9( m , fc ) , *) e = ^(v (m , fc ) (*)) 



Note that there is a unique fixed point V* m fe ^ 
the Bellman equation 



that satisfies 



^(m,fc)(V* mfc )) 



(m,fc) 



Vj™, fc )(Q(m,*))e = (38) 



Appendix D: Proof of Lemma[2] 
From the definition of mapping T^ )fc) (Q( m ,fe),Q( m ,fc),p) 
in (ED, defining the vector form mapping T^ fe) (Q( mifc )) : 
R (i+w Q )x|p| R (i+w Q )x|p| where each e i ements is give n 

b y T (i,fc)( ( Q(m,fc),'9(m,fc),p)- The convergence property of 
the per-user Q-factor update algorithm in (l27l i is equivalent 
to the following update equation |2~T1 : 



*-(m,fe) 



n(<9f 



(m,fc) 



7(*) 

' Pm 



ft l \ 
*~{m,k) 



2^+1 

(m,k) 



(39) 



where Z t+1 



IS 



^jm,fc)( < 9(m,k))P) 
T (m,fc)( ( Q(m,fc)' < 3(™^)'P) 



the vector form of ^^ fc) (Q( m ,fc)) P) 

= P(m,k)f(Q(m,k)) + Q^.k^Q'fm.k)) 



u, 



(m.k) 



A 



17« 



(m,k) 



^(m,fc) W(m,fc)> 
and ^(m.fc) = Q(m,fc) - 

is determined by the ICI 



management control pattern p and local CSI H 



= Let Ft = <y(Q[ mky Z[ 

algebra generated by {Q( m)fe) ,Z' 



be verified that Er 



, I < t) be the jg 



,fc) A m ,fc)H (m,/c) 



(m,k) 
z t+l 



< 

,i < 

F f ] 



(m,/c) • 

be the cr- 
£}, It can 
= 0, and 
2 ) for a 



{H (m , fc) ,A ( ,„, fc) }[||Z^ fc) || 2 |Ft] < 

suitable constant Ci. Therefore, the learning algorithm in 
( |39l > is also a standard stochastic learning algorithm with 
the Martingale difference noise Z! +1 , The limiting ODE 
associated to track asymptotically is given by: 



,*)(*) =T? m , fc) (Q(m,*)(*))- 
(Q (rii. 



(40) 



Furthermore, there is a unique fixed point Q* m ^ satisfying 
the following equation 11241 : 

T f m ,k) (Q(m,*)) " " Q( ro ,fe)W(m,fe)'P™) e = < 41 > 

and it is proved in [24] that 



( rn.k ) 



is the globally asymp- 
totically stable equilibrium for (1401 . As a result, following 
the same argument in the convergence proof of per-user value 
function in Lemma [T] we can conclude that the iterates of the 
update Q* m fc) -> Q* m k ) almost-surely. 
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